The dataset used for this analysis is scraped from IMDb, a popular online database of movies, and encompasses a wide range of films released between 1980 and 2020. The inclusion of over 7000 movies in the dataset offers a comprehensive perspective on the movie industry, covering a significant span of four decades. This large sample size enhances the statistical validity of the analysis and enables robust conclusions to be drawn about the relationships between these movie indicators.
dataset source: https://www.kaggle.com/datasets/danielgrijalvas/movies
This project focuses on exploring the relationships between various movie indicators, specifically budget, gross revenue, score, genre, and number of votes.
By examining the dataset, this project aims to uncover insights into how these movie indicators are connected and how they influence each other. This exploration will provide valuable information on the financial performance, audience reception, and critical acclaim of movies within the given time frame.
Through this project, it is possible to gain insights into how the budget allocated to a movie impacts its financial success, as measured by its gross revenue. Additionally, the analysis aims to determine the connection between the popularity of a movie, as indicated by its gross revenue, and its overall score, which reflects critical reception or audience ratings.
By exploring these relationships, the project intends to provide a deeper understanding of the dynamics within the movie industry and the factors that contribute to a movie's success or failure. The insights derived from this analysis can be valuable for movie studios, filmmakers, and industry professionals in making informed decisions regarding budgeting, marketing strategies, and overall movie production.
#Importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import re
import cpi
import matplotlib.pyplot as plt
import matplotlib
#Setting the plotting style to 'ggplot'
plt.style.use('ggplot')
#Setting the figure size
matplotlib.rcParams['figure.figsize'] = (12,8)
#Reading the data
df = pd.read_csv(r'C:\Users\learu\OneDrive\Documents\Portfolio\Movies - Python\movies.csv')
#Querying the data
df
name | rating | genre | year | released | score | votes | director | writer | star | country | budget | gross | company | runtime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | The Shining | R | Drama | 1980 | June 13, 1980 (United States) | 8.4 | 927000.0 | Stanley Kubrick | Stephen King | Jack Nicholson | United Kingdom | 19000000.0 | 46998772.0 | Warner Bros. | 146.0 |
1 | The Blue Lagoon | R | Adventure | 1980 | July 2, 1980 (United States) | 5.8 | 65000.0 | Randal Kleiser | Henry De Vere Stacpoole | Brooke Shields | United States | 4500000.0 | 58853106.0 | Columbia Pictures | 104.0 |
2 | Star Wars: Episode V - The Empire Strikes Back | PG | Action | 1980 | June 20, 1980 (United States) | 8.7 | 1200000.0 | Irvin Kershner | Leigh Brackett | Mark Hamill | United States | 18000000.0 | 538375067.0 | Lucasfilm | 124.0 |
3 | Airplane! | PG | Comedy | 1980 | July 2, 1980 (United States) | 7.7 | 221000.0 | Jim Abrahams | Jim Abrahams | Robert Hays | United States | 3500000.0 | 83453539.0 | Paramount Pictures | 88.0 |
4 | Caddyshack | R | Comedy | 1980 | July 25, 1980 (United States) | 7.3 | 108000.0 | Harold Ramis | Brian Doyle-Murray | Chevy Chase | United States | 6000000.0 | 39846344.0 | Orion Pictures | 98.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7663 | More to Life | NaN | Drama | 2020 | October 23, 2020 (United States) | 3.1 | 18.0 | Joseph Ebanks | Joseph Ebanks | Shannon Bond | United States | 7000.0 | NaN | NaN | 90.0 |
7664 | Dream Round | NaN | Comedy | 2020 | February 7, 2020 (United States) | 4.7 | 36.0 | Dusty Dukatz | Lisa Huston | Michael Saquella | United States | NaN | NaN | Cactus Blue Entertainment | 90.0 |
7665 | Saving Mbango | NaN | Drama | 2020 | April 27, 2020 (Cameroon) | 5.7 | 29.0 | Nkanya Nkwai | Lynno Lovert | Onyama Laura | United States | 58750.0 | NaN | Embi Productions | NaN |
7666 | It's Just Us | NaN | Drama | 2020 | October 1, 2020 (United States) | NaN | NaN | James Randall | James Randall | Christina Roz | United States | 15000.0 | NaN | NaN | 120.0 |
7667 | Tee em el | NaN | Horror | 2020 | August 19, 2020 (United States) | 5.7 | 7.0 | Pereko Mosia | Pereko Mosia | Siyabonga Mabaso | South Africa | NaN | NaN | PK 65 Films | 102.0 |
7668 rows × 15 columns
#Checking if there are empty cells
for col in df.columns:
missing = np.mean(df[col].isnull())
print('{}: {}%'.format(col, round(missing*100)))
name: 0% rating: 1% genre: 0% year: 0% released: 0% score: 0% votes: 0% director: 0% writer: 0% star: 0% country: 0% budget: 28% gross: 2% company: 0% runtime: 0%
# Dropping rows with no values in 'budget' or 'gross' columns as it will affect the analysis
df.dropna(subset=['budget', 'gross'], inplace=True)
df
name | rating | genre | year | released | score | votes | director | writer | star | country | budget | gross | company | runtime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | The Shining | R | Drama | 1980 | June 13, 1980 (United States) | 8.4 | 927000.0 | Stanley Kubrick | Stephen King | Jack Nicholson | United Kingdom | 19000000.0 | 46998772.0 | Warner Bros. | 146.0 |
1 | The Blue Lagoon | R | Adventure | 1980 | July 2, 1980 (United States) | 5.8 | 65000.0 | Randal Kleiser | Henry De Vere Stacpoole | Brooke Shields | United States | 4500000.0 | 58853106.0 | Columbia Pictures | 104.0 |
2 | Star Wars: Episode V - The Empire Strikes Back | PG | Action | 1980 | June 20, 1980 (United States) | 8.7 | 1200000.0 | Irvin Kershner | Leigh Brackett | Mark Hamill | United States | 18000000.0 | 538375067.0 | Lucasfilm | 124.0 |
3 | Airplane! | PG | Comedy | 1980 | July 2, 1980 (United States) | 7.7 | 221000.0 | Jim Abrahams | Jim Abrahams | Robert Hays | United States | 3500000.0 | 83453539.0 | Paramount Pictures | 88.0 |
4 | Caddyshack | R | Comedy | 1980 | July 25, 1980 (United States) | 7.3 | 108000.0 | Harold Ramis | Brian Doyle-Murray | Chevy Chase | United States | 6000000.0 | 39846344.0 | Orion Pictures | 98.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7648 | Bad Boys for Life | R | Action | 2020 | January 17, 2020 (United States) | 6.6 | 140000.0 | Adil El Arbi | Peter Craig | Will Smith | United States | 90000000.0 | 426505244.0 | Columbia Pictures | 124.0 |
7649 | Sonic the Hedgehog | PG | Action | 2020 | February 14, 2020 (United States) | 6.5 | 102000.0 | Jeff Fowler | Pat Casey | Ben Schwartz | United States | 85000000.0 | 319715683.0 | Paramount Pictures | 99.0 |
7650 | Dolittle | PG | Adventure | 2020 | January 17, 2020 (United States) | 5.6 | 53000.0 | Stephen Gaghan | Stephen Gaghan | Robert Downey Jr. | United States | 175000000.0 | 245487753.0 | Universal Pictures | 101.0 |
7651 | The Call of the Wild | PG | Adventure | 2020 | February 21, 2020 (United States) | 6.8 | 42000.0 | Chris Sanders | Michael Green | Harrison Ford | Canada | 135000000.0 | 111105497.0 | 20th Century Studios | 100.0 |
7652 | The Eight Hundred | Not Rated | Action | 2020 | August 28, 2020 (United States) | 6.8 | 3700.0 | Hu Guan | Hu Guan | Zhi-zhong Huang | China | 80000000.0 | 461421559.0 | Beijing Diqi Yinxiang Entertainment | 149.0 |
5436 rows × 15 columns
#Checking again if there are empty cells
for col in df.columns:
missing = np.mean(df[col].isnull())
print('{}: {}%'.format(col, round(missing*100)))
name: 0% rating: 0% genre: 0% year: 0% released: 0% score: 0% votes: 0% director: 0% writer: 0% star: 0% country: 0% budget: 0% gross: 0% company: 0% runtime: 0%
#Extracting the year from the released column
def extract_releaseyear(data):
pattern = r'\b\d{4}\b' # Regular expression pattern to match 4-digit year
match = re.search(pattern, data)
if match:
return int(match.group())
else:
pattern = r'\b\w+ \d{1,2}, \d{4}\b' # Pattern to match "Month day, year"
match = re.search(pattern, data)
if match:
return int(match.group().split()[-1])
else:
pattern = r'\b\d{4}\b' # Pattern to match 4-digit year
match = re.search(pattern, data)
if match:
return int(match.group())
else:
return None
df['year_released'] = df['released'].astype(str).apply(extract_releaseyear)
df
name | rating | genre | year | released | score | votes | director | writer | star | country | budget | gross | company | runtime | year_released | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | The Shining | R | Drama | 1980 | June 13, 1980 (United States) | 8.4 | 927000.0 | Stanley Kubrick | Stephen King | Jack Nicholson | United Kingdom | 19000000.0 | 46998772.0 | Warner Bros. | 146.0 | 1980 |
1 | The Blue Lagoon | R | Adventure | 1980 | July 2, 1980 (United States) | 5.8 | 65000.0 | Randal Kleiser | Henry De Vere Stacpoole | Brooke Shields | United States | 4500000.0 | 58853106.0 | Columbia Pictures | 104.0 | 1980 |
2 | Star Wars: Episode V - The Empire Strikes Back | PG | Action | 1980 | June 20, 1980 (United States) | 8.7 | 1200000.0 | Irvin Kershner | Leigh Brackett | Mark Hamill | United States | 18000000.0 | 538375067.0 | Lucasfilm | 124.0 | 1980 |
3 | Airplane! | PG | Comedy | 1980 | July 2, 1980 (United States) | 7.7 | 221000.0 | Jim Abrahams | Jim Abrahams | Robert Hays | United States | 3500000.0 | 83453539.0 | Paramount Pictures | 88.0 | 1980 |
4 | Caddyshack | R | Comedy | 1980 | July 25, 1980 (United States) | 7.3 | 108000.0 | Harold Ramis | Brian Doyle-Murray | Chevy Chase | United States | 6000000.0 | 39846344.0 | Orion Pictures | 98.0 | 1980 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7648 | Bad Boys for Life | R | Action | 2020 | January 17, 2020 (United States) | 6.6 | 140000.0 | Adil El Arbi | Peter Craig | Will Smith | United States | 90000000.0 | 426505244.0 | Columbia Pictures | 124.0 | 2020 |
7649 | Sonic the Hedgehog | PG | Action | 2020 | February 14, 2020 (United States) | 6.5 | 102000.0 | Jeff Fowler | Pat Casey | Ben Schwartz | United States | 85000000.0 | 319715683.0 | Paramount Pictures | 99.0 | 2020 |
7650 | Dolittle | PG | Adventure | 2020 | January 17, 2020 (United States) | 5.6 | 53000.0 | Stephen Gaghan | Stephen Gaghan | Robert Downey Jr. | United States | 175000000.0 | 245487753.0 | Universal Pictures | 101.0 | 2020 |
7651 | The Call of the Wild | PG | Adventure | 2020 | February 21, 2020 (United States) | 6.8 | 42000.0 | Chris Sanders | Michael Green | Harrison Ford | Canada | 135000000.0 | 111105497.0 | 20th Century Studios | 100.0 | 2020 |
7652 | The Eight Hundred | Not Rated | Action | 2020 | August 28, 2020 (United States) | 6.8 | 3700.0 | Hu Guan | Hu Guan | Zhi-zhong Huang | China | 80000000.0 | 461421559.0 | Beijing Diqi Yinxiang Entertainment | 149.0 | 2020 |
5436 rows × 16 columns
#Querying data types of each column
df.dtypes
name object rating object genre object year int64 released object score float64 votes float64 director object writer object star object country object budget float64 gross float64 company object runtime float64 year_released int64 dtype: object
#Changing the data type of some columns for readability
df['votes'] = df['votes'].astype('int64')
df['budget'] = df['budget'].astype('int64')
df['gross'] = df['gross'].astype('int64')
df['year_released'] = df['year_released'].astype('int64')
df
name | rating | genre | year | released | score | votes | director | writer | star | country | budget | gross | company | runtime | year_released | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | The Shining | R | Drama | 1980 | June 13, 1980 (United States) | 8.4 | 927000 | Stanley Kubrick | Stephen King | Jack Nicholson | United Kingdom | 19000000 | 46998772 | Warner Bros. | 146.0 | 1980 |
1 | The Blue Lagoon | R | Adventure | 1980 | July 2, 1980 (United States) | 5.8 | 65000 | Randal Kleiser | Henry De Vere Stacpoole | Brooke Shields | United States | 4500000 | 58853106 | Columbia Pictures | 104.0 | 1980 |
2 | Star Wars: Episode V - The Empire Strikes Back | PG | Action | 1980 | June 20, 1980 (United States) | 8.7 | 1200000 | Irvin Kershner | Leigh Brackett | Mark Hamill | United States | 18000000 | 538375067 | Lucasfilm | 124.0 | 1980 |
3 | Airplane! | PG | Comedy | 1980 | July 2, 1980 (United States) | 7.7 | 221000 | Jim Abrahams | Jim Abrahams | Robert Hays | United States | 3500000 | 83453539 | Paramount Pictures | 88.0 | 1980 |
4 | Caddyshack | R | Comedy | 1980 | July 25, 1980 (United States) | 7.3 | 108000 | Harold Ramis | Brian Doyle-Murray | Chevy Chase | United States | 6000000 | 39846344 | Orion Pictures | 98.0 | 1980 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7648 | Bad Boys for Life | R | Action | 2020 | January 17, 2020 (United States) | 6.6 | 140000 | Adil El Arbi | Peter Craig | Will Smith | United States | 90000000 | 426505244 | Columbia Pictures | 124.0 | 2020 |
7649 | Sonic the Hedgehog | PG | Action | 2020 | February 14, 2020 (United States) | 6.5 | 102000 | Jeff Fowler | Pat Casey | Ben Schwartz | United States | 85000000 | 319715683 | Paramount Pictures | 99.0 | 2020 |
7650 | Dolittle | PG | Adventure | 2020 | January 17, 2020 (United States) | 5.6 | 53000 | Stephen Gaghan | Stephen Gaghan | Robert Downey Jr. | United States | 175000000 | 245487753 | Universal Pictures | 101.0 | 2020 |
7651 | The Call of the Wild | PG | Adventure | 2020 | February 21, 2020 (United States) | 6.8 | 42000 | Chris Sanders | Michael Green | Harrison Ford | Canada | 135000000 | 111105497 | 20th Century Studios | 100.0 | 2020 |
7652 | The Eight Hundred | Not Rated | Action | 2020 | August 28, 2020 (United States) | 6.8 | 3700 | Hu Guan | Hu Guan | Zhi-zhong Huang | China | 80000000 | 461421559 | Beijing Diqi Yinxiang Entertainment | 149.0 | 2020 |
5436 rows × 16 columns
#Adjusting the value of the budget and gross column considering inflation using the cpi library
def inflation_adjust(data, column):
return data.apply(lambda x: cpi.inflate(x[column], x.year_released), axis=1)
df['budget_inflation_adjust'] = inflation_adjust(df, 'budget')
df['gross_inflation_adjust'] = inflation_adjust(df, 'gross')
df
name | rating | genre | year | released | score | votes | director | writer | star | country | budget | gross | company | runtime | year_released | budget_inflation_adjust | gross_inflation_adjust | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | The Shining | R | Drama | 1980 | June 13, 1980 (United States) | 8.4 | 927000 | Stanley Kubrick | Stephen King | Jack Nicholson | United Kingdom | 19000000 | 46998772 | Warner Bros. | 146.0 | 1980 | 6.748113e+07 | 1.669226e+08 |
1 | The Blue Lagoon | R | Adventure | 1980 | July 2, 1980 (United States) | 5.8 | 65000 | Randal Kleiser | Henry De Vere Stacpoole | Brooke Shields | United States | 4500000 | 58853106 | Columbia Pictures | 104.0 | 1980 | 1.598237e+07 | 2.090249e+08 |
2 | Star Wars: Episode V - The Empire Strikes Back | PG | Action | 1980 | June 20, 1980 (United States) | 8.7 | 1200000 | Irvin Kershner | Leigh Brackett | Mark Hamill | United States | 18000000 | 538375067 | Lucasfilm | 124.0 | 1980 | 6.392949e+07 | 1.912114e+09 |
3 | Airplane! | PG | Comedy | 1980 | July 2, 1980 (United States) | 7.7 | 221000 | Jim Abrahams | Jim Abrahams | Robert Hays | United States | 3500000 | 83453539 | Paramount Pictures | 88.0 | 1980 | 1.243073e+07 | 2.963968e+08 |
4 | Caddyshack | R | Comedy | 1980 | July 25, 1980 (United States) | 7.3 | 108000 | Harold Ramis | Brian Doyle-Murray | Chevy Chase | United States | 6000000 | 39846344 | Orion Pictures | 98.0 | 1980 | 2.130983e+07 | 1.415198e+08 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7648 | Bad Boys for Life | R | Action | 2020 | January 17, 2020 (United States) | 6.6 | 140000 | Adil El Arbi | Peter Craig | Will Smith | United States | 90000000 | 426505244 | Columbia Pictures | 124.0 | 2020 | 1.017691e+08 | 4.822782e+08 |
7649 | Sonic the Hedgehog | PG | Action | 2020 | February 14, 2020 (United States) | 6.5 | 102000 | Jeff Fowler | Pat Casey | Ben Schwartz | United States | 85000000 | 319715683 | Paramount Pictures | 99.0 | 2020 | 9.611522e+07 | 3.615240e+08 |
7650 | Dolittle | PG | Adventure | 2020 | January 17, 2020 (United States) | 5.6 | 53000 | Stephen Gaghan | Stephen Gaghan | Robert Downey Jr. | United States | 175000000 | 245487753 | Universal Pictures | 101.0 | 2020 | 1.978843e+08 | 2.775895e+08 |
7651 | The Call of the Wild | PG | Adventure | 2020 | February 21, 2020 (United States) | 6.8 | 42000 | Chris Sanders | Michael Green | Harrison Ford | Canada | 135000000 | 111105497 | 20th Century Studios | 100.0 | 2020 | 1.526536e+08 | 1.256345e+08 |
7652 | The Eight Hundred | Not Rated | Action | 2020 | August 28, 2020 (United States) | 6.8 | 3700 | Hu Guan | Hu Guan | Zhi-zhong Huang | China | 80000000 | 461421559 | Beijing Diqi Yinxiang Entertainment | 149.0 | 2020 | 9.046138e+07 | 5.217604e+08 |
5436 rows × 18 columns
#Showing all the digits in the inflation adjusted columns for readability
df['budget_inflation_adjust'] = df['budget_inflation_adjust'].apply(lambda x: int(x))
df['gross_inflation_adjust'] = df['gross_inflation_adjust'].apply(lambda x: int(x))
df
name | rating | genre | year | released | score | votes | director | writer | star | country | budget | gross | company | runtime | year_released | budget_inflation_adjust | gross_inflation_adjust | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | The Shining | R | Drama | 1980 | June 13, 1980 (United States) | 8.4 | 927000 | Stanley Kubrick | Stephen King | Jack Nicholson | United Kingdom | 19000000 | 46998772 | Warner Bros. | 146.0 | 1980 | 67481128 | 166922641 |
1 | The Blue Lagoon | R | Adventure | 1980 | July 2, 1980 (United States) | 5.8 | 65000 | Randal Kleiser | Henry De Vere Stacpoole | Brooke Shields | United States | 4500000 | 58853106 | Columbia Pictures | 104.0 | 1980 | 15982372 | 209024948 |
2 | Star Wars: Episode V - The Empire Strikes Back | PG | Action | 1980 | June 20, 1980 (United States) | 8.7 | 1200000 | Irvin Kershner | Leigh Brackett | Mark Hamill | United States | 18000000 | 538375067 | Lucasfilm | 124.0 | 1980 | 63929490 | 1912113534 |
3 | Airplane! | PG | Comedy | 1980 | July 2, 1980 (United States) | 7.7 | 221000 | Jim Abrahams | Jim Abrahams | Robert Hays | United States | 3500000 | 83453539 | Paramount Pictures | 88.0 | 1980 | 12430734 | 296396789 |
4 | Caddyshack | R | Comedy | 1980 | July 25, 1980 (United States) | 7.3 | 108000 | Harold Ramis | Brian Doyle-Murray | Chevy Chase | United States | 6000000 | 39846344 | Orion Pictures | 98.0 | 1980 | 21309830 | 141519803 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7648 | Bad Boys for Life | R | Action | 2020 | January 17, 2020 (United States) | 6.6 | 140000 | Adil El Arbi | Peter Craig | Will Smith | United States | 90000000 | 426505244 | Columbia Pictures | 124.0 | 2020 | 101769051 | 482278157 |
7649 | Sonic the Hedgehog | PG | Action | 2020 | February 14, 2020 (United States) | 6.5 | 102000 | Jeff Fowler | Pat Casey | Ben Schwartz | United States | 85000000 | 319715683 | Paramount Pictures | 99.0 | 2020 | 96115215 | 361524020 |
7650 | Dolittle | PG | Adventure | 2020 | January 17, 2020 (United States) | 5.6 | 53000 | Stephen Gaghan | Stephen Gaghan | Robert Downey Jr. | United States | 175000000 | 245487753 | Universal Pictures | 101.0 | 2020 | 197884266 | 277589508 |
7651 | The Call of the Wild | PG | Adventure | 2020 | February 21, 2020 (United States) | 6.8 | 42000 | Chris Sanders | Michael Green | Harrison Ford | Canada | 135000000 | 111105497 | 20th Century Studios | 100.0 | 2020 | 152653577 | 125634456 |
7652 | The Eight Hundred | Not Rated | Action | 2020 | August 28, 2020 (United States) | 6.8 | 3700 | Hu Guan | Hu Guan | Zhi-zhong Huang | China | 80000000 | 461421559 | Beijing Diqi Yinxiang Entertainment | 149.0 | 2020 | 90461379 | 521760382 |
5436 rows × 18 columns
#Deleting duplicate rows if any. In tis case, there are no duplicate rows.
df.drop_duplicates()
#Sorting from highest to lowest gross
df_sorted = df.sort_values(by=['gross_inflation_adjust'], inplace=False, ascending=False)
df_sorted
name | rating | genre | year | released | score | votes | director | writer | star | country | budget | gross | company | runtime | year_released | budget_inflation_adjust | gross_inflation_adjust | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3045 | Titanic | PG-13 | Drama | 1997 | December 19, 1997 (United States) | 7.8 | 1100000 | James Cameron | James Cameron | Leonardo DiCaprio | United States | 200000000 | 2201647264 | Twentieth Century Fox | 194.0 | 1997 | 364679127 | 4014474018 |
5445 | Avatar | PG-13 | Action | 2009 | December 18, 2009 (United States) | 7.8 | 1100000 | James Cameron | James Cameron | Sam Worthington | United States | 237000000 | 2847246203 | Twentieth Century Fox | 162.0 | 2009 | 323297310 | 3883995942 |
7445 | Avengers: Endgame | PG-13 | Action | 2019 | April 26, 2019 (United States) | 8.4 | 903000 | Anthony Russo | Christopher Markus | Robert Downey Jr. | United States | 356000000 | 2797501328 | Marvel Studios | 181.0 | 2019 | 407519371 | 3202348267 |
6663 | Star Wars: Episode VII - The Force Awakens | PG-13 | Action | 2015 | December 18, 2015 (United States) | 7.8 | 876000 | J.J. Abrams | Lawrence Kasdan | Daisy Ridley | United States | 245000000 | 2069521700 | Lucasfilm | 138.0 | 2015 | 302511950 | 2555326719 |
209 | E.T. the Extra-Terrestrial | PG | Family | 1982 | June 11, 1982 (United States) | 7.8 | 381000 | Steven Spielberg | Melissa Mathison | Henry Thomas | United States | 10500000 | 792910554 | Universal Pictures | 115.0 | 1982 | 31843290 | 2404655317 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5640 | Tanner Hall | R | Drama | 2009 | January 15, 2015 (Sweden) | 5.8 | 3500 | Francesca Gregorini | Tatiana von Fürstenberg | Rooney Mara | United States | 3000000 | 5073 | Two Prong Lesson | 96.0 | 2015 | 3704227 | 6263 |
2434 | Philadelphia Experiment II | PG-13 | Action | 1993 | June 4, 1994 (South Korea) | 4.5 | 1900 | Stephen Cornwell | Wallace C. Bennett | Brad Johnson | United States | 5000000 | 2970 | Trimark Pictures | 97.0 | 1994 | 9873650 | 5864 |
3681 | Ginger Snaps | Not Rated | Drama | 2000 | May 11, 2001 (Canada) | 6.8 | 43000 | John Fawcett | Karen Walton | Emily Perkins | Canada | 5000000 | 2554 | Copperheart Entertainment | 108.0 | 2001 | 8262422 | 4220 |
2417 | Madadayo | NaN | Drama | 1993 | April 17, 1993 (Japan) | 7.3 | 5100 | Akira Kurosawa | Ishirô Honda | Tatsuo Matsumura | Japan | 11900000 | 596 | DENTSU Music And Entertainment | 134.0 | 1993 | 24100999 | 1207 |
3203 | Trojan War | PG-13 | Comedy | 1997 | October 1, 1997 (Brazil) | 5.7 | 5800 | George Huang | Andy Burg | Will Friedle | United States | 15000000 | 309 | Daybreak | 85.0 | 1997 | 27350934 | 563 |
5436 rows × 18 columns
#Correlation Matrix between selected numeric columns
selected_numeric = ['year_released','score','votes','budget_inflation_adjust','gross_inflation_adjust','runtime']
correlation_matrix = df[selected_numeric].corr(method='pearson')
correlation_matrix
year_released | score | votes | budget_inflation_adjust | gross_inflation_adjust | runtime | |
---|---|---|---|---|---|---|
year_released | 1.000000 | 0.061029 | 0.202883 | 0.159099 | 0.159096 | 0.074432 |
score | 0.061029 | 1.000000 | 0.473809 | 0.059347 | 0.242249 | 0.414580 |
votes | 0.202883 | 0.473809 | 1.000000 | 0.421948 | 0.635112 | 0.352437 |
budget_inflation_adjust | 0.159099 | 0.059347 | 0.421948 | 1.000000 | 0.677118 | 0.339182 |
gross_inflation_adjust | 0.159096 | 0.242249 | 0.635112 | 0.677118 | 1.000000 | 0.282979 |
runtime | 0.074432 | 0.414580 | 0.352437 | 0.339182 | 0.282979 | 1.000000 |
#Creating a heatmap to visualize the correlation matrix
sns.heatmap(correlation_matrix, annot = True, cmap="mako")
plt.title("Correlation matrix of Numeric Features")
plt.show()
#Highest correlation is between gross and budget
#Plotting a scatter plot with a linear regression line using seaborn
sns.regplot(x="budget_inflation_adjust", y="gross_inflation_adjust", data=df, scatter_kws={"color":"SteelBlue"}, line_kws={"color":"DarkSeaGreen"})
#Calculating the correlation coefficient and rounding to 6 decimal places
corr_coeff1 = round(df['budget_inflation_adjust'].corr(df['gross_inflation_adjust']), 6)
#Creating the annotation text
text = f'Correlation Coefficient: {corr_coeff1}'
#Adding the annotation to the plot
plt.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction')
#Adding labels
plt.title('Gross vs. Budget')
plt.xlabel("Budget (inflation adjusted)")
plt.ylabel("Gross (inflation adjusted)")
#Displaying the plot
plt.show()
The correlation coefficient of 0.677118 suggests a moderate to strong positive correlation between a movie's gross revenue and its budget. This correlation implies that, on average, movies with higher budgets have a higher likelihood of generating higher gross revenue.
It is important to note that correlation does not necessarily imply causation. While a higher budget may contribute to a movie's success and marketing efforts, there are other factors at play that can influence a film's gross revenue, such as the quality of the script, direction, acting, marketing strategy, competition, release timing, and audience reception.
However, a positive correlation between gross and budget suggests that investing more resources into a movie's production and marketing may increase the chances of generating higher revenue. Larger budgets often allow for better production values, elaborate visual effects, renowned actors, and extensive marketing campaigns, all of which can attract a wider audience and potentially result in higher ticket sales and other revenue streams such as merchandise and licensing.
It is important to consider that this correlation may not hold true for every movie. There will always be exceptions where movies with lower budgets perform exceptionally well at the box office, and vice versa. Additionally, other factors like genre, target audience, critical reception, and competition within the industry can significantly impact a movie's financial success.
#2nd highest correlation is between number of votes and gross
#Plotting a scatter plot with a linear regression line using seaborn
sns.regplot(x="gross_inflation_adjust", y="votes", data=df, scatter_kws={"color":"SteelBlue"}, line_kws={"color":"DarkSeaGreen"})
#Calculating the correlation coefficient and rounding to 6 decimal places
corr_coeff2 = round(df['gross_inflation_adjust'].corr(df['votes']), 6)
#Creating the annotation text
text = f'Correlation Coefficient: {corr_coeff2}'
#Adding the annotation to the plot
plt.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction')
#Adding labels
plt.title('No. of Votes vs. Gross')
plt.xlabel("Gross (inflation adjusted)")
plt.ylabel("No. of Votes")
#Displaying the plot
plt.show()
The correlation coefficient of 0.635112 between the gross revenue of a movie and the number of votes it receives suggests a moderate positive relationship between these two variables. This means that there is a tendency for movies with higher gross revenue to also receive a larger number of votes.
The positive correlation implies that as the gross revenue of a movie increases, there is a higher likelihood for the movie to also accumulate more votes. This suggests that movies that perform well financially, in terms of higher ticket sales, streaming revenue, DVD sales, and other revenue streams, tend to attract a larger audience and generate more votes.
The moderate correlation coefficient of 0.635112 indicates a reasonably strong relationship between the gross revenue and the number of votes. It suggests that there is a connection between the financial success of a movie and the level of audience engagement, as reflected by the number of votes it receives.
However, it's important to note that correlation does not imply causation. While the correlation suggests that movies with higher gross revenue tend to receive more votes, other factors can also influence this relationship. For example, factors such as marketing efforts, genre appeal, critical reception, release timing, and overall quality of the film can all impact both the gross revenue and the number of votes a movie accumulates.
#3rd highest correlation is between number of votes and score
#Plotting a scatter plot with a linear regression line using seaborn
sns.regplot(x='votes', y='score', data=df, scatter_kws={"color":"SteelBlue"}, line_kws={"color":"DarkSeaGreen"})
#Calculating the correlation coefficient and rounding to 6 decimal places
corr_coeff3 = round(df['votes'].corr(df['score']), 6)
#Creating the annotation text
text = f'Correlation Coefficient: {corr_coeff3}'
#Adding the annotation to the plot
plt.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction')
#Adding labels
plt.title('Score vs. No. of Votes')
plt.xlabel('No. of Votes')
plt.ylabel('Score')
#Displaying the plot
plt.show()
The correlation between Score and No. of Votes (0.473809) indicates that there is a weak positive relationship between the score of a movie and the number of votes it receives. This means that movies that receive more votes tend to have a higher score.
However, it's crucial to understand that the weak correlation coefficient of 0.473809 suggests that the relationship between the score and the number of votes is not a strong predictor. This means that the number of votes a movie receives does not significantly determine its rating, and vice versa. A movie can receive a large number of votes but still have a low score, or vice versa. Therefore, it is important to consider other factors that may influence the score of a movie, such as the genre, plot, acting, and direction, among others.
#Summarizing by Genre
bygenre = df.groupby('genre').agg({
'name': 'count',
'score': 'mean',
'budget_inflation_adjust': 'mean',
'gross_inflation_adjust': 'mean'
})
#Showing all the digits in the inflation adjusted columns for readability
bygenre['budget_inflation_adjust'] = bygenre['budget_inflation_adjust'].apply(lambda x: int(x))
bygenre['gross_inflation_adjust'] = bygenre['gross_inflation_adjust'].apply(lambda x: int(x))
#Renaming column names
bygenre = bygenre.rename(columns={'name': 'movie count', 'score': 'mean score', 'budget_inflation_adjust': 'mean budget', 'gross_inflation_adjust': 'mean gross'})
#Sorting by mean score
bygenre_sorted = bygenre.sort_values('mean score',ascending=False)
#Displaying sorted data
bygenre_sorted
movie count | mean score | mean budget | mean gross | |
---|---|---|---|---|
genre | ||||
Biography | 312 | 7.084936 | 38713742 | 90650204 |
Drama | 869 | 6.723590 | 38210333 | 97235517 |
Animation | 278 | 6.695683 | 106559080 | 386520569 |
Crime | 400 | 6.690250 | 38180888 | 80763881 |
Family | 4 | 6.675000 | 71147707 | 985405770 |
Mystery | 17 | 6.670588 | 51723053 | 185561825 |
Romance | 5 | 6.580000 | 40070013 | 52899276 |
Sci-Fi | 6 | 6.350000 | 43117574 | 55340360 |
Adventure | 327 | 6.268196 | 71892995 | 200162330 |
Action | 1417 | 6.247212 | 87010877 | 246321671 |
Comedy | 1496 | 6.190709 | 38035571 | 97351092 |
Fantasy | 42 | 6.004762 | 31609345 | 70088598 |
Western | 2 | 5.950000 | 26806289 | 21153639 |
Thriller | 7 | 5.928571 | 23029084 | 64544680 |
Horror | 254 | 5.825197 | 21679940 | 83787647 |
Despite having mean budget and mean gross values that fall in the middle range compared to other genres, the Biography genre stands out with the highest mean score. This suggests that a movie's popularity, as reflected by its gross, does not necessarily correlate with its score which is also indicated by a weak correlation coeficient of 0.242249 in the previous correlation matrix.
On the other hand, the Family genre showcases the highest mean gross, which can be attributed to its broader target audience compared to other genres. However, it is important to note that this finding is based on the analysis of only four Family movies included in the dataset, indicating the need for further studies to validate this observation.
Furthermore, Animation movies have the highest mean budget and the second highest mean gross. This suggests that the genre invests significantly in production costs, which potentially contributes to its financial success.
#Plotting the number of movies by genre with a score of at least 8
#Filtering movies with a score of at least 8
highscore_df = df[df['score'] >= 8]
#Calculating the movie count by genre
genre_counts = highscore_df['genre'].value_counts()
#Generating a purple to blue gradient colormap
cmap = plt.colormaps['PuBu']
#Normalizing the data for mapping to colormap
norm = plt.Normalize(np.min(genre_counts.values), np.max(genre_counts.values))
#Creating a bar graph
plt.bar(genre_counts.index, genre_counts.values, color=cmap(norm(genre_counts.values)))
#Setting the labels and title
plt.xlabel('Genre')
plt.ylabel('Movie Count')
plt.title('Movie Count by Genre with Score >= 8')
#Displaying the bar graph
plt.show()
In contrast to its ranking in terms of mean gross in the previous table, the Drama genre has the most number of highly-rated movies. This further supports the previous observation that a movie's popularity, as reflected by its gross, does not necessarily correlate with its score.
Additionally, despite having the highest movie count in this dataset, the Comedy genre shows a relatively low number of movies with a high score. This suggests that while Comedy movies may be abundant, only a few of them manage to attain a high score. It indicates that the primary focus of the Comedy genre might be on providing entertainment value rather than delivering profound storytelling or cinematic excellence.
In summary, it is found that there is a moderate to strong positive correlation between a movie's gross revenue and its budget which implies that, on average, movies with higher budgets have a higher likelihood of generating higher gross revenue. Additionally, the correlation between the gross revenue of a movie and the number of votes it receives suggests a moderate positive relationship between these two variables which means that there is a tendency for movies with higher gross revenue to also receive a larger number of votes. Also, there is a weak positive relationship between the score of a movie and the number of votes it receives which suggests that movies that receive more votes might have a higher score but because of the weak correlation, the number of votes a movie receives does not significantly determine its rating, and vice versa. It is also found that a movie's popularity, as reflected by its gross, does not necessarily correlate with its score which is also indicated by a weak correlation coeficient.
While these measures provide insights into the performance of different genres, it is crucial to conduct comprehensive analyses and consider additional factors to gain a deeper understanding of the dynamics within the movie industry. It highlights the importance of considering other factors, such as critical acclaim and audience reception, to gain a comprehensive understanding of a movie's overall impact and quality.